Copyright Eli Lilly and Company

The drake R package

Reproducible computation at scale

drake

Will Landau

Why drake?


Data analysis has interconnected steps.

Each update…

…can invalidate other work.

Do you hunt for all the changes yourself?

  • Messy and prone to human error.
  • Not reproducible.

https://openclipart.org/detail/216179/messy-desk

Do you rerun everything from scratch?

  • Takes too long.
  • Too frustrating.

https://openclipart.org/detail/275842/sisyphus-overcoming-silhouette

Pipeline toolkits for large computation

Small example drake project: GDP

  • Which quantity is a better predictor of GDP per capita, life expectancy or population?
    • Gapminder data: observations on multiple countries from 1952 to 2007.
    • Bayesian regeression with rstanarm: straightforward inference and interpretation.
  • But Bayesian methods can be computationally expensive!

GDP analysis pipeline

The drake plan: steps of the workflow.

The plan is just a data frame.

Support for the plan

Run your workflow.

Output file report.html

Get targets from the cache.

Find things to improve.

Go back and change a function.

Which targets need an update?

vis_drake_graph()

Run only the parts that need to change.

Life expectancy has a stronger association.

Reproducibilty is about confidence and trust.


  • Tangible evidence that your results match the code and data at hand:

Scale up to many targets

  • Predict gross state product using an econometrics dataset.

Scale up to many targets

  • Experimental interface coming to drake >= 7.0.0.

Scale up to many targets.

## # A tibble: 170 x 2
##    target              command                                             
##    <chr>               <chr>                                               
##  1 model_state_year_p… fit_gsp_model(gsp ~ state + year + pcap, data = Ecd…
##  2 model_state_year_h… fit_gsp_model(gsp ~ state + year + hwy, data = Ecda…
##  3 model_state_year_w… fit_gsp_model(gsp ~ state + year + water, data = Ec…
##  4 model_state_year_u… fit_gsp_model(gsp ~ state + year + util, data = Ecd…
##  5 model_state_year_pc fit_gsp_model(gsp ~ state + year + pc, data = Ecdat…
##  6 model_state_year_e… fit_gsp_model(gsp ~ state + year + emp, data = Ecda…
##  7 model_state_year_u… fit_gsp_model(gsp ~ state + year + unemp, data = Ec…
##  8 model_state_pcap_h… fit_gsp_model(gsp ~ state + pcap + hwy, data = Ecda…
##  9 model_state_pcap_w… fit_gsp_model(gsp ~ state + pcap + water, data = Ec…
## 10 model_state_pcap_u… fit_gsp_model(gsp ~ state + pcap + util, data = Ecd…
## # … with 160 more rows

Scale up to many targets.

Persistent parallel workers

Persistent parallel workers

Transient parallel workers

Distributed computing: transient workers


Get drake and get help.

Thanks

References